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DIVISION: 2626 

DETAILED ACTION 
Continued Examination Under 37 CFR LI 14 

1 . A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 
CFR 1.17(e), was filed in this application after final rejection. Since this application is eligible for 
continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been 
timely paid, the finality of the previous Office action has been withdrawn pursuant to 

37 CFR 1.1 14. 

The Applicant's PRELIMINARY AMENDMENT, filed on February 14, 2006, has been 
entered. An action continuing examination on the merits follows. The text of those sections of 
Title 35, U.S. Code not included in this action can be found in a prior Office action. 

Claim Informalities 

2. Claim 2, and by dependency claim 3, are objected to under 37 CFR 1.75(a) because the 
meaning of the phrase "the step of analyzing" (line 2) needs clarification. Because no analyzing 
was previously recited, it may be unclear as to what element this phrase refers. To further timely 
prosecution and evaluate prior art, the Examiner has interpreted this phase as -the step of 
generating—. 

3. Claim 9 is objected to for the same reasons as claim 2 because the limitations are recited 
using obviously similar phrases. 

4. Claim 10, and by dependency claim 1 1, are objected to for the same reasons as claim 2 
because the limitations are recited using obviously similar phrases. 

5. Claim 1 5 is objected to under 37 CFR 1 .75(a) because the meaning of the phrase "the step 
of conversion" (line 2), and the meaning of the symbol "N" (line 2) need clarification. Because no 
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conversion and no definition of "N" were previously recited, it may be unclear as to what element 
this phrase refers. To further timely prosecution and evaluate prior art, the Examiner has 
interpreted claim 15 as dependent to claim 8, because that is the nearest preceding claim to 
provide sufficient antecedence. 

6. Claim 15 is objected to for the same reasons as claim 2 because the limitations are recited 
using obviously similar phrases. 

Claim Rejections - 35 USC § 102 

Basu 

7. Claims 1, 4, 5, 9, and 14 are rejected under 35 U.S.C. 102(e) as being anticipated by Basu 
[US Patent 6,594,629], already of record. 

8. Regarding claim 14, Basu [at column 17] describes an apparatus for extracting visemes 
from an audio speech signal by describing the content and functionality of the recited limitations 
recognizable as a whole to one versed in the art as the following terminology: 

means for receiving digitized analog speech information from the audio speech signal, 
means for filtering, and means for generating [see Fig. 12, and its descriptions, especially at 
column 18, line 66-column 19, line 59, of the processor, memory, and software of the sampled 
audio (talking, speech) stream]; 

successive speech is frames at a fixed rate [at column 17, lines 42-43, as audio frames 
spaced 10 msec in time]; 

receiving the speech as successive speech at the fixed rate [at column 6, lines 16-18, as the 
every 10-msec advance of a segment of speech]; 

filtering each of the successive frames of digitized analog speech information to 
synchronously generate time domain frame vectors at the fixed rate [see Fig. 12, and its 
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descriptions, especially at column 6, lines 14-19, of the extraction process advancing segments of 
sampled speech every 10 msec and extracting succeeding acoustic cepstral vectors]; 

wherein each of the vectors is derived from one of the successive frames [at column 6, 
lines 14-19, as succeeding acoustic cepstral vectors extracted from each 10-msec advance of the 
segment of speech]; 

they are classification vectors [at column 6, lines 38-40, as the probability module labeling 
the extracted vectors with phonemes]; 

synchronously generate a sequence of a set of visemes wherein each set of visemes in the 
sequence is derived for a corresponding one of the vectors [at column 17, lines 51-55, as assign 
probabilities to visemes for vectors provided to the probability module of the time instant when 
the audio frame occurs]. 

9. Claim 1 sets forth a method with limitations comprising the functionality associated with 
using the apparatus recited in claim 14. Basu describes those similar limitations as indicated 
there; accordingly, this claim also is anticipated. 

10. Regarding claim 4, Basu also describes: 

each set includes viseme identifiers [at column 13, lines 42-44 and 55-56, as visual speech 
feature vectors (visemes) labeled with phonemes]; 

each set includes confidence numbers [at column 13, lines 61-62, as combine with a 
confidence estimation that refers to a likelihood]; 

the confidence corresponds one to one [at column 13, lines 44-46, as each phoneme 
associated with visual speech feature vectors has a probability associated therewith]. 

1 1 . Regarding claim 5, Basu describes the included claim elements by dependency as indicated 
elsewhere in this Office action. Basu also describes: 
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the set consists of an identity of the most likely one [at column 18, lines 5 and 53-54, as 
rescore the N-best list to recognize the highest likelihood]; 

it is a viseme [at column 17, lines 51-55, as assign probabilities to visemes]. 

12. Regarding claim 9, Basu also describes: 

a spatial classification [at column 18, lines 53-55, as rescore based on video]. 

Sutton 

13. Claims 1, 2, 9, 10, and 12-14 are rejected under 35 U.S.C. 102(e) as being anticipated by 
Sutton [US Patent 6,539,354], already of record. 

14. Regarding claim 14, Sutton [at claim 36] describes an apparatus for extracting visemes 
from an audio speech signal by describing the content and functionality of the recited limitations 
recognizable as a whole to one versed in the art as the following terminology: 

means for receiving, means for filtering, and means for generating [at column 23, lines 46- 
56, as computer code comprising instructions]; 

receiving successive frames of digitized analog speech information from the audio speech 
signal at a fixed rate [at column 19, lines 1-3, as receive an input stream in frames of a speech 
wave at a sampling rate in 10 ms frames]; 

filtering each of the successive frames to synchronously generate time domain vectors at 
the fixed rate [at column 19, lines 2-5, as compute, for each frame in 10 ms frames, a feature 
representation for each frame in 10 ms frames]; 

the vectors are frame classification vectors [at column 19, lines 5-16, as produce phoneme 
(phone) estimates using the window assembled from feature representations to have 16 10-ms 
frames, produce viseme data for the frames]; 
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wherein each of the vectors is derived from one of the successive frames [at column 19, 
line 5, as compute a feature representation for each frame]; 

synchronously generating a sequence of a set of visemes derived from the vectors [at 
column 19, lines 5-16, as assemble each feature representation into a (feature) window, produce 
phoneme (phone) estimates using the window assembled from feature representations to have 16 
10-ms frames, produce viseme data for the frames]; 

wherein each set of visemes in the sequence is derived from a corresponding one of the 
vectors [at column 26, lines 25-34, as one or more visemes active during each of the frames of a 
voice input is identified (for a phoneme) corresponding to each frame]. 

15. Claim 1 sets forth a method with limitations comprising the functionality associated with 
using the apparatus recited in claim 14. Basu describes those similar limitations as indicated 
there; accordingly, this claim also is anticipated. 

16. Regarding claim 2, Sutton describes the included claim elements by dependency as 
indicated elsewhere in this Office action. Sutton also describes: 

with a latency less than 100 msec with reference to a successive frame [at column 19, 
lines 27-29, as the latency is around 80 ms]. 

17. Regarding claim 9, Sutton also describes: 

a spatial classification [at column 19, lines 33-39, as a dedicated viseme estimator trained 
on viseme deformability to go from speech input to visemes in a single neural network]. 

18. Regarding claim 10, Sutton also describes: 

by a neural network (or other) [at column 19, lines 10-1 1, as include a neural network]. 
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19. Claim 12 sets forth limitations similar to claim 14. Sutton describes the limitations as 
indicated there, for a processor and software that provide the means. Sutton also describes 
additional limitations as follows: 

a processor and a memory that stores programmed instructions that control the processor 
[at column 15, lines 34-45, as a server storing all the software]. 

20. Claim 13 sets forth limitations similar to claim 14. Sutton describes the limitations as 
indicated there, for a processor and software that provide the means. Sutton also describes 
additional limitations as follows: 

a processor and a memory that stores programmed instructions that control the processor 
[at column 15, lines 34-45, as a server storing all the software]; 

a display that displays an avatar that is formed [at column 22, lines 3-17, as a display 
through which a synthesis visual output according to the method has a 3D character for reading]; 

using the set of visemes [at column 17, lines 36-43, as viseme tracks are used to render an 
animation]. 

Claim Rejections - 35 USC §103 

Basu and Thomson 

21. Claims 6-8 are rejected under 35 U.S.C. 103(a) as being unpatentable over Basu [US 
Patent 6,594,629] in view of David J. Thomson , "An Overview of Multiple- Window and 
Quadratic-Inverse Spectrum Estimation Methods," IEEE 1994, pp. VI 185- VI 194, both already of 
record. 
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22. Claim 6 includes the limitations of claim 1. Basu describes those limitations as indicated 
there. Basu also describes: 

convert each frame to a spectral domain vector [at column 6, lines 20-21, as extract 
magnitudes of discrete Fourier transforms in a frame]; 

convert the spectral vectors using DCT [at column 8, lines 24-28, as transform the 
amplitude values, subsequently apply a discrete cosine transform]. 

However, Basu does not provide details of Fourier transformation to the spectral domain. 
In particular, Basu does not explicitly describe using prolate spheroid basis functions. 

Thomson [at section 2., section 8., and section 1.] examines transformation from the time 
domain to the spectral domain using the discrete Fourier transform for acoustics, speech, and 
signal processing, and Thomson describes: 

convert to a spectral domain vector using N multi-taper discrete prolate spheroid sequence 
basis (MTDPSSB) functions [see Eq. (18) and its description of projecting to a frequency domain 
by N-l windows of a Slepian sequence (Discrete Prolate Spheroidal Wave Functions)]; 

they are factors of a Fredholm integral of the first kind [at page VI- 186, column 1, as the 
projection operation of spectrum estimation is a Fredholm integral of the first kind]; 

N is a positive integer [see Eq. (18) and its summation limits from 0 to N-l]. 

As indicated, Thomson shows that using N MTDPSSB functions was known to artisans at 
the time of invention. Since Thomson [at page VI-1 88, column 1] also points out that MTDPSSB 
functions have the advantage of the best possible leakage properties for handling a dynamic range, 
it would have been obvious to one of ordinary skill in the art of converting data to the spectral 
domain at the time of invention to include the concepts described by Thomson at least using the 
MTDPSSB functions in Basu's conversion to the spectral domain because the MTDPSSB 
functions were known to have the advantage of the best possible leakage properties for handling a 
dynamic range. 
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23. Regarding claim 7, Basu and Thomson describe and make obvious the included claim 
elements by dependency as indicated elsewhere in this Office action. Thomson also describes: 

multiplying a successive frame by one of the MTDPSSB functions to generate N product 
sets of the frame [see Eq. (18) and its description, of multiplying (for windowing) the data x n by 
the N values of a Slepian sequence to generate the values of K windows]; 

performing a FFT of each produce set to generate N FFT sets of the frame [see Eq. (18) 
and its description, of the exp(-i27rf) and sum, for the FFT used for coefficient computation]; 

adding (change adding to combining because the addition is done to magnitude spectrums 
rather than separately to the real and imaginary components) together the N FFT sets of the frame 
to generate a summed FFT set of the frame [see page VI- 187, column 1, and the example for a 
Simple Spectrum Estimate, of summing (combining) the square of the absolute value (magnitude) 
of K coefficients of the expansion coefficients from Fourier transforming]. 

24. Regarding claim 8, Basu and Thomson describe and make obvious the included claim 
elements by dependency as indicated elsewhere in this Office action. Thomson also describes: 

scaling the summed FFT set of the successive frame(s) [see page VI-187, column 1, and 
the example for a Simple Spectrum Estimate, dividing by K the total of summing K coefficients of 
the expansion coefficients from Fourier transforming]. 

Sutton and Peterson 

25. Claims 3 and 1 1 are rejected under 35 U.S.C. 103(a) as being unpatentable over Sutton 
[US Patent 6,539,354] in view of Peterson et al. [US Patent 5,067,095], using the same rationale 
as in a previous Office action, which is reproduced here. 
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26. Claim 3 includes the limitations of claim 2. Sutton describes those limitations as indicated 
there. Sutton [at columns 18-19] also suggests lower latency hold advantages; however, Sutton 
does not explicitly describe latency less than 10 msec. 

Like Sutton , Peterson [at column 1, lines 56-59] describes a neural network for speech 
recognition, and Peterson also describes: 

latency less than 10 milliseconds [at column 11, lines 27-46, as typically 20 elements and 
delays of 10 microseconds (20 x .01 ms = .2 ms) to provide an output signal from the input 
signal]. 

As indicated, Peterson shows that latency less than 10 milliseconds was known to artisans 
at the time of invention. Since Peterson [at column 1, lines 44-55] also points out that neural 
network processing has the inherent advantage of offering real time execution, it would have been 
obvious to one of ordinary skill in the art of real time speech recognition at the time of invention 
to include the concepts described by Peterson at least latency less than 10 milliseconds by 
adjusting Sutton 's neural network to a latency less than 10 milliseconds because that would 
provide faster processing within whatever certain degree of error can be tolerated. 

27. Claim 1 1 includes the limitations of claim 9. Sutton describes those limitations as 
indicated there. Although Sutton describes speech recognition and viseme classification using 
neural networks, Sutton does not describe detail of a neural network. In particular, Sutton does 
not explicitly describe a feed- forward, memory-less, perceptron type neural network. 

Like Sutton , Peterson [at column 1, lines 56-59] describes a neural network for speech 
recognition, and Peterson also describes: 

a neural network [see Figs. 4a and 4b and their descriptions especially at column 6, lines 
16-68, of the SPANN (sequence processing artificial neural network)]; 

feed- forward type [see Figs. 4a and 4b and their descriptions especially at column 6, lines 
1 6-68, of signals-applied-at-the-inputs-processed-and-provided-through-the-outputs] ; 
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memory-less type [see Figs. 4a and 4b and their descriptions especially at column 6, lines 
16-68, of signals-applied-processed-and-output]; 

perceptron type [see Figs. 4a and 4b and their descriptions especially at column 6, lines 16- 
68, of neurons]. 

As indicated, Peterson shows that a feed- forward, memory-less, perceptron type neural 
network was known to artisans at the time of invention. The system by Sutton requires a neural 
network, but merely any neural network from mature technologies. Sutton has not disclosed a 
preferred approach to those operations according to a design criterion or solution to any stated 
problem. Since it appears that the use of any neural network that is known to artisans would 
perform to provide Sutton 's requirement of low latency, it would have been obvious to one of 
ordinary skill in the art of real time speech recognition at the time of invention to include the 
concepts described by Peterson at least a feed-forward, memory-less, perceptron type neural 
network according to Sutton 's suggestion for low latency because Peterson [at column 1, lines 44- 
55] indicates that would provide faster processing within whatever certain degree of error can be 
tolerated. 

Basu and Thomson and Peterson 

28. Claim 15 is rejected under 35 U.S.C. 103(a) as being unpatentable over Basu [US Patent 
6,594,629] in view of David J. Thomson , "An Overview of Multiple- Window and Quadratic- 
Inverse Spectrum Estimation Methods," IEEE 1994, pp. VI 185-VI 194 and Peterson et al. [US 
Patent 5,067,095], all already of record. 

29. Claim 15 includes the limitations of claim 8, if the Examiner's assumption about 
dependency is correct. Basu and Thomson describe and make obvious the included claim 
elements by dependency as indicated elsewhere in this Office action, including Basu's speech 
frames spaced 10 msec in time. Thomson also describes: 
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N is less (than 5, or other) [at page VI-193, column 2, in paragraph leading to Eq. (78), as 
two Slepian sequences]. 

Both Basu [at column 17, lines 1-17] and Thomson [at section 1.] suggest the desirability 
of real-time use of recognition applications. 

However, neither Basu nor Thomson explicitly describes latency less than 10 msec with 
respect to a frame with which the visemes correspond. 

Like Basu , Peterson [at column 1, lines 56-59] describes speech recognition, and Peterson 
also describes: 

latency less than 10 milliseconds [at column 11, lines 27-46, as typically 20 elements and 
delays of 10 microseconds (20 x .01 ms = .2 ms) to provide an output signal from the input 
signal]. 

As indicated, Peterson shows that latency less than 10 milliseconds in recognition 
applications was known to artisans at the time of invention. Since Peterson [at column 1, lines 44- 
55] also points out that neural network processing has the inherent advantage of offering real time 
execution, it would have been obvious to one of ordinary skill in the art of real time speech 
recognition at the time of invention to include the concepts described by Peterson , at least latency 
less than 10 milliseconds with reference to Basu's successive 10-msec frames of speech being 
processed for viseme recognition, because that would provide processing results as near to real 
time as possible within whatever certain degree of error can be tolerated. 

Response to Arguments 
30. The prior Office action, mailed November 14, 2005, objects to the claims, and rejects 
claims under 35 USC §102 and § 103, citing Basu , Sutton , and others. The Applicant's arguments 
and changes in PRELIMINARY AMENDMENT, filed February 14, 2006, have been fully 
considered with the following results. 
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31 . With respect to objection to those claims needing clarification, the changes entered by 
amendment provide clear descriptions of the claimed subject matter. Accordingly, the objection is 
removed. Please see new grounds of objection. 

32. With respect to rejection of claims under 35 USC § 102 and § 103, citing Basu alone and in 
combination, the Applicant's arguments appear to be as follows: 

The Applicant's argument appears to be that the clear scope of the claimed invention that 
distinguishes from Basu is each classification vector derived from one frame of speech, and the 
vector leading to a corresponding set of visemes. This argument is not persuasive because Basu 
has successive 10-msec durations, which meet the conditions of "frame" as a term of art and 
which Basu calls "frames." the 10-msec segments advance, and Basu extracts a vector, which is 
used for classification. Basu then generates a set of visemes for the vector at the time of the audio 
frame. For column and line citations, see the rejection in this Office action. Column 6, lines 14- 
15, use the terminology "frame" to label a 25-msec segment of speech, the successive 25-msec 
durations also meet the conditions of "frame" as a term of art and Basu also applies the term 
"frame" to the 25-msec durations. However, Basu [at column 17] classifies the vectors and 
visemes of the 10-msec frames. 

Although it is not material to the rejections and discussion of this Office action, the 
Examiner believes that the Applicant's characterization of "frame" as used in the art to be the 
"smallest set of digitized audio samples analyzed as a group" {italics added} is too restrictive. For 
example as shown in Basu , the smallest used is 10-msec, but Basu also calls 25 msec, a "frame". 

The Applicant's arguments have been fully considered but they are not persuasive. 
Accordingly, the rejections are maintained. 

33. With respect to rejection of claims under 35 USC § 102 and § 103, citing Sutton alone and 
in combination, the Applicant's arguments appear to be as follows: 
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The Applicant's argument appears to be that Sutton does not explicitly describe the 
correlation of visemes to vector to frame when visemes are produced using the 160-msec window 
of speech. This argument is not persuasive because of Sutton 's embodiment of claim 36, where 
the embodiment with one-to-one correspondence a viseme and other visemes to each frame is 
explicit. As to the classification feature vector correspondence, there is also Sutton 's general 
teaching of timing at column 19 of a feature representation for each frame in 10 ms frames. For 
column and line citations, see the rejection in this Office action. 

The Applicant's arguments have been fully considered but they are not persuasive. 
Accordingly, the rejections are maintained. 

34. With respect to rejection of claims under 35 USC § 103, citing Thomson in combination, 
the Applicant's arguments appear to be as follows: 

The Applicant's argument appears to be that the Applicant has purposely chosen to trade- 
off more bin leakage than necessary with Thomson 's functions in order to achieve a desired low 
level of latency. Consequently, a motivation to achieve low leakage characteristics of MTDPSSB 
is not appropriate. That argument is not persuasive for similar reasons that were given in the 
previous Office action, namely, that each artisan does not have to find the same benefits in the 
prior art in order to be motivated to use prior art teaching. An artisan may find the combination 
of teaching in the prior art advantageous for a different reason than the reason put forth by the 
Applicant. While the Applicant's argument here points to an advantage for low latency, it 
mistakenly relies on the premise that the prior art must teach that a particular reason is preferred 
for the combination to be obvious. As long as some motivation or suggestion to combine the 
references is provided by the prior art taken as a whole, obviousness does not require that the 
teachings be combined for the reasons contemplated by the Applicant. 

The Applicant's arguments have been fully considered but they are not persuasive. 
Accordingly, the rejections are maintained. 
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Conclusion 

35. Any response to this action should be mailed to: 

Mail Stop Amendment 

Commissioner for Patents 
P.O. Box 1450 
Alexandria, VA 22313-1450 

or faxed to: 

(571) 273-8300, (for both formal communications intended for entry and for 
informal or draft communications, but please label informal fax as "PROPOSED" 
or "DRAFT") 

Patent Correspondence delivered by hand or delivery services, other than the USPS, should 
be addressed as follows and brought to U.S. Patent and Trademark Office, Customer 
Service Window, Mail Stop Amendment, Randolph Building, 401 Dulany Street, 
Alexandria, VA 22314 

***************** IMPORTANT NOTICE *********************** 
The Examiner handling this application, who was assigned to Art Unit 2654, is assigned to 
DIVISION 2626 as a result of consolidation in Technology Center 2600. Please include the new 
Division in the caption or heading of any communication. Your cooperation in this matter will 
assist in the timely processing of the submission and is appreciated by the Office. 

36. Any inquiry concerning this communication or earlier communications from the examiner 
should be directed to Donald L. Storm, of Division 2626, whose telephone number is 

(571) 272-7614. The examiner can normally be reached on weekdays between 7:00 AM and 3:30 
PM Eastern Time. If attempts to reach the examiner by telephone are unsuccessful, the 
examiner's supervisor, Richemond Dorvil can be reached on (571) 272-7602. 

Information regarding the status of an application may be obtained from the Patent 
Application Information Retrieval (PAIR) system. Inquiries regarding the status of submissions 
relating to an application or questions on the Private PAIR system should be directed to the 
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Electronic Business Center (EBC) at 866-217-9197 (toll-free) or 571-272-4100 between the hours 
of 6 a.m. and midnight Monday through Friday EST, or by e-mail at: ebc@uspto.gov. For general 
information about the PAIR system, see http://pair-direct.uspto.gov. If you would like assistance 
from a USPTO Customer Service Representative or access to the automated information system, 
call 800-786-9199 (IN USA OR CANADA) or 571-272-1000. 



July 11,2006 



Donald L. Storm 
Examiner, Division 2626 



